Sequential-Access FM-Indexes
نویسنده
چکیده
Previous authors have shown how to build FM-indexes efficiently in external memory, but querying them efficiently remains an open problem. Searching näıvely for a pattern P requires Θ(|P |) random access. In this paper we show how, by storing a few small auxiliary tables, we can access data only in the order in which they appear on disk, which should be faster. An FM-index [4] is a compressed representation of a text that allows us to quickly search for arbitrary patterns in that text. Their growing popularity in genomics (e.g., in BWT-SW, Bowtie, SOAP2 and BWA) means we should look for ways in which they can handle massive datasets, which may have to reside in external memory even when compressed. Unfortunately, although we know how to build FM-indexes efficiently in external memory [3], querying them efficiently remains an open problem. Searching näıvely for a pattern P requires Θ(|P |) random access, which are expensive due to seek times. We refer the reader to the papers by Chien et al. [1], Hon et al. [5] and Ferragina [2] for more discussion of this problem. In this paper we extend a result by Orlandi and Venturini [6] to show how, by storing a few small auxiliary tables, we can access data only in the order in which they appear on disk. We may read slightly more data but, since sequential access to disk is orders of magnitude faster than random access, our modified index should be faster overall. FM-indexes are based on the Burrows-Wheeler Transform (BWT), which permutes the characters of a string T based on the contexts that follow them. We can compute B = BWT(T ) by lexicographically sorting the rotations of T , then recording the last character of each rotation. (If we want to recover T later, we append a special symbol before computing B or record the position to which a designated character is mapped.) For example, if T = 110111100101110101010001111 , then B = 110111011001001011111010110 . We use binary strings for simplicity but the results in this paper extend to any reasonable alphabet size. Notice that, for any pattern P , the characters immediately preceding occurrences of P in T are adjacent in B (considering T to be cyclic). For example, if P = 0101 then the characters immediately preceding occurrences of T are T [8], T [14] and T [16], which are mapped to B[7], B[6] and B[5], respectively. We call B[5..7] the interval for P = 0101. The basic operation of FM-indexes is to find the interval in B for any given pattern P . For example, the length of the interval is the number of occurrences of P in T . Notice that the left endpoint of the interval is the rank of the lexicographically first rotation of T that starts with P , and the right endpoint is the rank of the lexicographically last such rotation. To find these endpoints, we store data structures such that, for any character c in the alphabet and any position i in B, we can quickly compute the number rankc(i) of occurrences of c in B[1..i]. We also store the number C[c] of characters in B lexicographically less than c. Suppose we are näıvely searching for the right endpoint of the interval; finding the left endpoint is essentially symmetric. We iteratively compute j1 = rankP [|P |](|B|) + C [
منابع مشابه
Fixed Block Compression Boosting in FM-Indexes
A compressed full-text self-index occupies space close to that of the compressed text and simultaneously allows fast pattern matching and random access to the underlying text. Among the best compressed self-indexes, in theory and in practice, are several members of the FM-index family. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementati...
متن کاملFMtree: a fast locating algorithm of FM-indexes for genomic data
Motivation As a fundamental task in bioinformatics, searching for massive short patterns over a long text has been accelerated by various compressed full-text indexes. These indexes are able to provide similar searching functionalities to classical indexes, e.g. suffix trees and suffix arrays, while requiring less space. For genomic data, a well-known family of compressed full-text indexes, cal...
متن کاملFast Locating with the RLBWT
Indexing highly repetitive texts — such as genomic databases, software repositories and versioned text collections — has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) ...
متن کاملMehdizadeh, Rahimeh RELATIONSHIP BETWEEN BODY WATER COMPARTMENTS AND INDEXES OF ADIPOSITY IN SEDENTARY YOUNG ADULT GIRLS
MEHDIZADEH, R. Relationship between body water compartments and indexes of adiposity in sedentary young adult girls. Brazilian Journal of Biomotricity. v. 6, n. 2, p. 84-92, 2012. Changes in Total body water (TBW) that occur with variation of body weight (BW), indicating a relation between body water content and adipose tissue mass, because BW changes directly relate to changes in FM content in...
متن کاملOptimized succinct data structures for massive data
Succinct data structures provide the same functionality as their corresponding traditional data structure in compact space. We improve on functions rank and select , which are the basic building blocks of FM-indexes and other succinct data structures. First, we present a cache-optimal, uncompressed bitvector representation which outperforms all existing approaches. Next, we improve — in both sp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1205.1195 شماره
صفحات -
تاریخ انتشار 2012